\(~\)

Day 2 Session 3: Data Visualization

\(~\)

This session, we will introduce how to visualize our data. There are two major sets of tools for creating plots in R:

\(~\)

We will be focusing on ggplot2 in our class. Because:

\(~\)

Policy advocacy should rely heavily on data. Sometimes drawing a figure (a.k.a. visualization) should be a critical step and can be even more precise than conventional statistical computations. A figure is powerful in itself.

1. The Dataset

\(~\)

For the following examples, we will be using the gapminder dataset. Gapminder is a country-year dataset with information on life expectancy, among other things.

\(~\)

If you have not already installed the gapminderpackage and you try to load it using the following code, you will get an error:

\(~\)

library(gapminder)
Error in library(gapminder) : there is no package called ‘gapminder’

\(~\)

If this happens, install the gapminder package by running install.packages("gapminder") in your console.

\(~\)

Once you’ve done this, run the following code to load the gapminder dataset, the tidyverse library, which includes ggplot2:

\(~\)

library(tidyverse)
library(gapminder)
## Warning: package 'gapminder' was built under R version 4.0.2
gap <- gapminder 
head(gap)
## # A tibble: 6 × 6
##   country     continent  year lifeExp      pop gdpPercap
##   <fct>       <fct>     <int>   <dbl>    <int>     <dbl>
## 1 Afghanistan Asia       1952    28.8  8425333      779.
## 2 Afghanistan Asia       1957    30.3  9240934      821.
## 3 Afghanistan Asia       1962    32.0 10267083      853.
## 4 Afghanistan Asia       1967    34.0 11537966      836.
## 5 Afghanistan Asia       1972    36.1 13079460      740.
## 6 Afghanistan Asia       1977    38.4 14880372      786.

\(~\)

Challenge

\(~\)

Once you load the date, based on what we’ve learned in previous classes, discuss the following questions within your group.

    1. How many countries and continents are there in the data?
    1. What is the time range?
    1. What is the mean of life expectancy in Africa?
    1. Please show the GDP per capita of Rwanda over the year.

\(~\)

(Hint: You can also run ?gapminder in the console to open the help file for the data and definitions for each of the columns.)

\(~\)

2. ggplot2 Grammer

\(~\)

The general call for ggplot2 looks like this:

\(~\)

ggplot(data =, aes(x = , y = )) + 
  geom_xxxx() + 
  geom_yyyy()

\(~\)

The grammar involves some basic components:

    1. Data: a data.frame
    1. Aesthetics: How your data are represented visually, aka its “mapping”. Which variables are shown on x, y axes, as well as color, size, shape, etc.
    1. Geometry: The geometric objects in a plot – histograms, points, lines, smooth lines, etc.

\(~\)

The key to understanding ggplot2 is thinking about a figure in layers: just like you might do in an image editing program like Photoshop

\(~\)

ggplot(data = gap, aes(x = gdpPercap, y = lifeExp)) +
  geom_point()

\(~\)

So the first thing we do is call the ggplot function. This function lets R know that we’re creating a new plot, and any of the arguments we give the ggplot function are the global options for the plot: they apply to all layers on the plot.

\(~\)

For the second argument we passed in the aes function, which tells ggplot how variables in the data map to aesthetic properties of the figure, in this case the x and y locations. Here we told ggplot we want to plot the lifeExp column of the gapminder data frame on the x-axis, and the gdpPercap column on the y-axis.

\(~\)

Notice that we didn’t need to explicitly pass aes these columns (e.g., x = gapminder$lifeExp), this is because ggplot is smart enough to know to look in the data for that column!

\(~\)

Then, we need to tell ggplot how we want to visually represent the data, which we do by adding a new geom layer. In our example, we used geom_point, which tells ggplot we want to visually represent the relationship between x and y as a scatterplot of points:

\(~\)

IMPORTANT: In ggplot, you are adding layers, so you should use + to separate each line of code!

IMPORTANT: In ggplot, you are adding layers, so you should use + to separate each line of code!

IMPORTANT: In ggplot, you are adding layers, so you should use + to separate each line of code!

\(~\)

ggplot(data = gap, aes(x = gdpPercap, y = lifeExp)) +
  geom_point()

\(~\)

Challenge

\(~\)

  1. Modify the example so that the figure visualizes how life expectancy has changed over time:

\(~\)

3. Anatomy of aes

\(~\)

In the previous examples and challenge we’ve used the aes function to tell the scatterplot geom about the x and y locations of each point. Another aesthetic property we can modify is the point color.

ggplot(data = gap, aes(x = gdpPercap, y = lifeExp, color = continent)) + 
  geom_point()

Then, we can add a line of code to set your color manually. You can also google the R color palette for detail color code.

ggplot(data = gap, aes(x = gdpPercap, y = lifeExp, color = continent)) + 
  geom_point() +
  scale_color_manual(values = c("gold", "lightblue", "red", "lightgreen", "pink"))

Furthermore, you can modify the opacity of points by alpha in your geom_point setting. alpha is in a range from 0 to 1.

ggplot(data = gap, aes(x = gdpPercap, y = lifeExp, color = continent)) + 
  geom_point(alpha = 0.5)

Color isn’t the only aesthetic argument we can set to display variation in the data. We can also vary by shape, size, etc. For example, we can also set the shape by continent too.

ggplot(data = gap, aes(x = gdpPercap, y = lifeExp, color = continent, shape = continent)) + 
  geom_point(alpha = 0.5)

\(~\)

4. Layers

\(~\)

In the previous challenge, you plotted lifExp over time. Using a scatterplot probably isn’t the best for visualizing change over time. Instead, let’s tell ggplot to visualise the data as a line plot:

ggplot(data = gap, aes(x = year, y = lifeExp, by = country, color = continent)) + 
  geom_line()

Instead of adding a geom_point layer, we’ve added a geom_line layer. We’ve also added the by aesthetic, which tells ggplot to draw a line for each country.

\(~\)

But what if we want to visualize both lines and points on the plot? We can simply add another layer to the plot:

ggplot(data = gap, aes(x = year, y = lifeExp, by = country, color = continent)) + 
  geom_line() + 
  geom_point()

It’s important to note that each layer is drawn on top of the previous layer. In this example, the points have been drawn on top of the lines. Here’s another demonstration:

ggplot(data = gap, aes(x = year, y = lifeExp, by = country)) + 
  geom_line(aes(color = continent)) + 
  geom_point()

In this example, the aesthetic mapping of color has been moved from the global plot options in ggplot to the geom_line layer so it no longer applies to the points. Now we can clearly see that the points are drawn on top of the lines.

\(~\)

Challenge

\(~\)

  1. Switch the order of the point and line layers from the previous example. What happened?

\(~\)

5. Labels and Themes

\(~\)

Labels are considered to be their own layers in ggplot. You can use labs(x = , y = , title = ) to set your labels.

# add x and y axis labels
ggplot(data = gap, aes(x = gdpPercap, y = lifeExp, color=continent)) + 
  geom_point(alpha = 0.5) + 
  labs(x = "GDP per capita (in US$)", y = "Life Expectancy (in years)", 
       title = "Relations of Life Expectancy and Ecomonic Development, by Continent")

You can also modify the theme of your plots. The themes in ggplot include theme_bw(), theme_classic(), theme_light(), theme_void(), etc. I recommend theme_few() in ggthemes package.

# add x and y axis labels
ggplot(data = gap, aes(x = gdpPercap, y = lifeExp, color = continent)) + 
  geom_point(alpha = 0.5) + 
  labs(x = "GDP per capita (in US$)", y = "Life Expectancy (in years)", 
       title = "Relations of Life Expectancy and Ecomonic Development, by Continent") +
  theme_few()

\(~\)

Challenge

\(~\)

  1. Try different themes in ggplot, and discuss in your group which one you prefer.

\(~\)

6. Transformations and Statistics

\(~\)

In ggplot, we can change the scale of units on the x-axis using the scale functions. These control the mapping between the data values and visual values of an aesthetic.

ggplot(data = gap, aes(x = gdpPercap, y = lifeExp, color = continent)) + 
  geom_point(alpha = 0.5) + 
  scale_x_log10() + # this sets the value in x asix in its log10
  labs(x = "Logged GDP per capita (in US$)", y = "Life Expectancy (in years)", 
       title = "Relations of Life Expectancy and Ecomonic Development, by Continent") +
  theme_few()

We can also manually do that in the global aesthetic setting. For example,

# Here I take the natural log transformation on GDP per capita
ggplot(data = gap, aes(x = log(gdpPercap), y = lifeExp, color = continent)) + 
  geom_point(alpha = 0.5) + 
  labs(x = "Logged GDP per capita (in US$)", y = "Life Expectancy (in years)", 
       title = "Relations of Life Expectancy and Ecomonic Development, by Continent") +
  theme_few()

Lastly, You can use data wrangling functions that we just learned to choose the data we want. For example, if we only take care of the data on all African countries in 2007.

# filter by all African countries in 2007
gap %>%
  filter(continent == "Africa" & year == 2007) %>% 
  ggplot(aes(x = log(gdpPercap), y = lifeExp)) +
  geom_point(alpha = 0.5) + 
  geom_smooth(method = "lm") + # we can even add a regression line
  labs(x = "Logged GDP per capita (in US$)", y = "Life Expectancy (in years)", 
       title = "Relations of Life Expectancy and Ecomonic Development in Africa Since 2007") +
  theme_few()

Pay attention here, when we use dpylr and pipes, we have %>% to separate lines; however, in ggplot, we have + instead!

You can also make shape on the point, for example.

# we want to highlight Rwanda
gap %>%
  filter(continent == "Africa" & year == 2007) %>% 
  mutate(rwanda = ifelse(country == "Rwanda", "Rwanda", "Others")) %>%
  ggplot(aes(x = log(gdpPercap), y = lifeExp)) +
  geom_point(aes(shape = rwanda, color = rwanda), alpha = 0.5) + 
  geom_smooth(method = "lm", show.legend = FALSE) +
  scale_color_manual(values = c("black", "red")) +
  scale_shape_manual(values = c(16, 8)) +
  labs(x = "Logged GDP per capita (in US$)", y = "Life Expectancy (in years)", 
       title = "Relations of Life Expectancy and Ecomonic Development in Africa") +
  theme_few()

\(~\)

Challenge

  1. Can you replicate these figures?
## Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.

## Warning: Please use `linewidth` instead.

7. Facets

Previously, we visualized the change in life expectancy over time across all countries in one plot. Alternatively, we can split this out over multiple panels by adding a layer of facet panels.

\(~\)

facet_wrap() is a useful tool to display patterns for different groups. For example:

ggplot(data = gap, aes(x = log(gdpPercap), y = lifeExp, color = continent)) + 
  geom_point(alpha = 0.5) + 
  facet_wrap(~ continent) +
  labs(x = "Logged GDP per capita (in US$)", y = "Life Expectancy (in years)", 
       title = "Relations of Life Expectancy and Ecomonic Development, by Continent") +
  theme_few()

\(~\)

If we would like to compare five continents in the same line, we can use ncol = or nrow to set how many facets we’d like to present in each column or row.

\(~\)

ggplot(data = gap, aes(x = log(gdpPercap), y = lifeExp, color = continent)) + 
  geom_point(alpha = 0.5) + 
  facet_wrap(~ continent, ncol = 5) +
  labs(x = "Logged GDP per capita (in US$)", y = "Life Expectancy (in years)", 
       title = "Relations of Life Expectancy and Ecomonic Development, by Continent") +
  theme_few()

Let’s go back to the Rwanda example. Let’s facet_wrap by year.

gap %>%
  filter(continent == "Africa") %>% 
  mutate(rwanda = ifelse(country == "Rwanda", "Rwanda", "Others")) %>%
  ggplot(aes(x = log(gdpPercap), y = lifeExp)) +
  geom_point(aes(shape = rwanda, color = rwanda), alpha = 0.5) + 
  geom_smooth(method = "lm") +
  facet_wrap(~year) +
  scale_color_manual(values = c("black", "red")) +
  scale_shape_manual(values = c(16, 8)) +
  labs(x = "Logged GDP per capita (in US$)", y = "Life Expectancy (in years)", 
       title = "Relations of Life Expectancy and Ecomonic Development in Africa") +
  theme_few()

\(~\)

8. Legends

\(~\)

Legends are more complicated than axes. Because:

\(~\)

    1. A legend can display multiple aesthetics (e.g., color and shape), from multiple layers, and the symbol displayed in a legend varies based on the geom used in the layer.
    1. Axes always appear in the same place. Legends can appear in different places, so you need some global way of controlling them.
    1. Legends have considerably more details that can be tweaked: should they be displayed vertically or horizontally? How many columns? How big should the keys be?

\(~\)

By default, a layer will only appear if the corresponding aesthetic is mapped to a variable with aes(). You can override whether or not a layer appears in the legend with show.legend = FALSE to prevent a layer from ever appearing in the legend; TRUE forces it to appear when it otherwise wouldn’t.

\(~\)

gap %>%
  filter(continent == "Africa") %>% 
  mutate(rwanda = ifelse(country == "Rwanda", "Rwanda", "Others")) %>%
  ggplot(aes(x = log(gdpPercap), y = lifeExp)) +
  geom_point(aes(shape = rwanda, color = rwanda), alpha = 0.5, show.legend = FALSE) + # HERE!
  geom_smooth(method = "lm") +
  facet_wrap(~year) +
  scale_color_manual(values = c("black", "red")) +
  scale_shape_manual(values = c(16, 8)) +
  labs(x = "Logged GDP per capita (in US$)", y = "Life Expectancy (in years)", 
       title = "Relations of Life Expectancy and Ecomonic Development in Africa") +
  theme_few()
## `geom_smooth()` using formula = 'y ~ x'

\(~\)

You can also change the location of legend with theme() function. The position and justification of legends are controlled by the theme setting legend.position, which takes values “right”, “left”, “top”, “bottom”, or “none” (no legend).

\(~\)

gap %>%
  filter(continent == "Africa") %>% 
  mutate(rwanda = ifelse(country == "Rwanda", "Rwanda", "Others")) %>%
  ggplot(aes(x = log(gdpPercap), y = lifeExp)) +
  geom_point(aes(shape = rwanda, color = rwanda), alpha = 0.5) + 
  geom_smooth(method = "lm") +
  facet_wrap(~year) +
  scale_color_manual(values = c("black", "red")) +
  scale_shape_manual(values = c(16, 8)) +
  labs(x = "Logged GDP per capita (in US$)", y = "Life Expectancy (in years)", 
       title = "Relations of Life Expectancy and Ecomonic Development in Africa",
       color = "", shape = "") +
  theme_few() +
  theme(legend.position = "bottom") # position
## `geom_smooth()` using formula = 'y ~ x'